Multi-step Off-policy Learning Without Importance Sampling Ratios

نویسندگان

Ashique Rupam Mahmood

Huizhen Yu

Richard S. Sutton

چکیده

To estimate the value functions of policies from exploratory data, most model-free offpolicy algorithms rely on importance sampling, where the use of importance sampling ratios often leads to estimates with severe variance. It is thus desirable to learn off-policy without using the ratios. However, such an algorithm does not exist for multi-step learning with function approximation. In this paper, we introduce the first such algorithm based on temporal-difference (TD) learning updates. We show that an explicit use of importance sampling ratios can be eliminated by varying the amount of bootstrapping in TD updates in an action-dependent manner. Our new algorithm achieves stability using a two-timescale gradient-based TD update. A prior algorithm based on lookup table representation called Tree Backup can also be retrieved using action-dependent bootstrapping, becoming a special case of our algorithm. In two challenging off-policy tasks, we demonstrate that our algorithm is stable, effectively avoids the large variance issue, and can perform substantially better than its state-of-the-art counterpart.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-step Reinforcement Learning: A Unifying Algorithm

Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD(λ) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter λ. Currently, there are a multitude of algorithms that can be used to perform TD control, in...

متن کامل

Off-policy learning based on weighted importance sampling with linear computational complexity

Importance sampling is an essential component of model-free off-policy learning algorithms. Weighted importance sampling (WIS) is generally considered superior to ordinary importance sampling but, when combined with function approximation, it has hitherto required computational complexity that is O(n2) or more in the number of features. In this paper we introduce new off-policy learning algorit...

متن کامل

Eligibility Traces for Off-Policy Policy Evaluation

Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the policy that generates the data. Off-policy methods can greatly multiply learning, as many policie...

متن کامل

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control

Policy gradient methods for direct policy optimization are widely considered to obtain optimal policies in continuous Markov decision process (MDP) environments. However, policy gradient methods require exponentially many samples as the dimension of the action space increases. Thus, off-policy learning with experience replay is proposed to enable the agent to learn by using samples of other pol...

متن کامل

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradien...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

CoRR

دوره abs/1702.03006 شماره

صفحات -

تاریخ انتشار 2017

Multi-step Off-policy Learning Without Importance Sampling Ratios

نویسندگان

چکیده

منابع مشابه

Multi-step Reinforcement Learning: A Unifying Algorithm

Off-policy learning based on weighted importance sampling with linear computational complexity

Eligibility Traces for Off-Policy Policy Evaluation

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

عنوان ژورنال:

اشتراک گذاری